arXiv:1508.00945v4 [stat.ML] 4 Jun 2016 


Structured Prediction: From Gaussian Perturbations 
to Linear-Time Principled Algorithms 


Jean Honorio 
CS, Purdue 

West Lafayette, IN 47907, USA 
jhonorioOpurdue.edu 


Tommi Jaakkola 
CSAIL, MIT 

Cambridge, MA 02139, USA 
tommiScsail.mit.edu 


Abstract 

Margin-based structured prediction commonly uses a 
maximum loss over all possible structured outputs DU 
In natural language processing, recent work |201I21) 
has proposed the use of the maximum loss over random 
structured outputs sampled independently from some 
proposal distribution. This method is linear-time in the 
number of random structured outputs and trivially par- 
allelizable. We study this family of loss functions in 
the PAC-Bayes framework under Gaussian perturbations 
m- Under some technical conditions and up to statis¬ 
tical accuracy, we show that this family of loss functions 
produces a tighter upper bound of the Gibbs decoder dis¬ 
tortion than commonly used methods. Thus, using the 
maximum loss over random structured outputs is a prin¬ 
cipled way of learning the parameter of structured predic¬ 
tion models. Besides explaining the experimental success 
of [JOlET] , our theoretical results show that more general 
techniques are possible. 


1 Introduction 

Structured prediction has been shown to be useful in 
many diverse domains. Application areas include nat¬ 
ural language processing (e.g., named entity recognition, 
part-of-speech tagging, dependency parsing), computer 
vision (e.g., image segmentation, multiple object track¬ 
ing), speech (e.g., text-to-speech mapping) and computa¬ 
tional biology (e.g., protein structure prediction). 

In dependency parsing, for instance, the observed input 
is a sentence and the desired structured output is a parse 
tree for the given sentence. 

In general, structured prediction can be viewed as a 
kind of decoding. A decoder is a machine for predicting 
the structured output y given the observed input x. Such 
a decoder, depends on a parameter w. Given a fixed 
w, the task performed by the decoder is called inference. 
In this paper, we focus on the problem of learning the 
parameter w. Next, we introduce the problem and our 
main contributions. 

We assume a distribution D on pairs {x, y) where x a X 
is the observed input and y € y is the latent structured 
output, i.e., {x,y) ~ D. We also assume that we have a 


training set S oi n i.i.d. samples drawn from the distri¬ 
bution D, i.e., S ~ I?", and thus IS”! = n. 

We let T(;c) ^0 denote the countable set of feasible 
decodings of x. In general, |(V(a;)| is exponential with 
respect to the input size. 

We assume a fixed mapping (jj from pairs to feature 
vectors, i.e., for any pair {x, y) we have the feature vector 
(/)(x, y) G \ {0}. For a parameter w G W C \ {0}, 
we consider linear decoders of the form: 

/i„(a;) = argmax())(a;,?/) • w (1) 

y&y{x) 

In practice, very few cases of the above general inference 
problem are tractable, while most are NP-hard and also 
hard to approximate within a fixed factor. (We defer the 
details in theory of computation to Section [51) 

We also introduce the distortion function 
d : T X T —>■ [0,1]. The value d{y,y') measures the 
amount of difference between two structured outputs y 
and y'. Disregarding the computational and statistical 
aspects, the ultimate goal is to set the parameter w in 
order to minimize the decoder distortion. That is: 


min, E [d(y,/™(a;))] (2) 

Computationally speaking, the above procedure is inef¬ 
ficient since d{y,fyj(x)) is a discontinuous function with 
respect to w and thus, it is in general an exponential-time 
optimization problem. Statistically speaking, the prob¬ 
lem in eq.(D) requires access to the data distribution D 
and thus, in general it would require an infinite amount of 
data. In practice, we only have access to a small amount 
of training data. 

Additionally, eq.([2]) would potentially favor parameters 
w with low distortion, but that could be in a neighbor¬ 
hood of parameters with high distortion. In order to avoid 
this issue, we could optimize a more “robust” objective 
under Gaussian perturbations. More formally, let a > 0 
and let Q{w) be a unit-variance Gaussian distribution 
centered at wa of parameters w' G W. The Gibbs de¬ 
coder distortion of the perturbation distribution Q{w) 
and data distribution D, is defined as: 


L{Q{w),D) 


E 

(x,y)~D 


, E [d[yjn,:{x))\ 

w'r^Q{w) 


(3) 
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The minimization of the Gibbs decoder distortion can be 
expressed as: 

min L(Q{w),D) 

The focus of our analysis will be to propose upper bounds 
of the Gibbs decoder distortion, with good computational 
and statistical properties. That is, we will propose upper 
bounds that can be computed in polynomial-time, and 
that require a small amount of training data. 

For our analysis, we follow the same set of assumptions 
as in m- We define the margin m{x,y^y',w) as the 
amount by which y is preferable to y' under the parameter 
w. More formally: 

y, y', w) = y)-w- (j){x, y') ■ w 

Let c(p, x, y) be a nonnegative integer that gives the num¬ 
ber of times that the part p € V appears in the pair {x, y). 
For a part p G V, we define the feature p as follows: 


(j)p{x,y) = c{p,x,y) 

We let V^x) 7 ^ 0 denote the set oi p GV such that there 
exists y G y{x) with c{p,x,y) > 0. We define the Ham¬ 
ming distance H as follows: 

H{x,y,y’)= y] \c{p,x,y) - c{p,x,y')\ 

pevix) 


The commonly applied margin-based approach to learn¬ 
ing w uses the maximum loss over all possible structured 
outputs p1l4lll8). That isli] 


. 1 

mm — 

Tl 


E 

{x,y)£S 


max d{y, y) 1 
yey(x) 


+ A||zc||^ 


/H{x,y,y) 
\-m{x,y,y,w) > 0 


(4) 


In Section [21 we reproduce the results in [T2] and show 
that the above objective is related to an upper bound of 
the Gibbs decoder distortion in eq.®. Note that evaluat¬ 
ing the objective function in eq.(|l|) is as hard as the infer¬ 
ence problem in eq. (El), since both perform maximization 
over the set 3^(a:). 

Our main contributions are presented in Sections [3] and 
m Inspired by recent work in natural language process¬ 
ing [inilll], we show a tighter upper bound of the Gibbs 
decoder distortion in eq.([2]), which is related to the fol¬ 
lowing objectivelil 


mm — y max d(y, y) 1 
wGW n ^ y&T{w,x) 

(x,y)&S 


+ A|k||^ 


fH{x,y,y) 
\-m{x,y,y,w) > 0 


(5) 


where T{w,x) is a set of random structured outputs sam¬ 
pled i.i.d. from some proposal distribution with support 
on y{x). Note that evaluating the objective function in 
eq.(IS|) is linear-time in the number of random structured 
outputs in T{w,x). 

^ For computational convenience, the convex hinge loss 
max (0,1 + z) is used in practice instead of the discontinuous 0/1 
loss 1 (z > 0). 


2 From PAC-Bayes to the Max¬ 
imum Loss Over All Possible 
Structured Outputs 

In this section, we show the relationship between PAG- 
Bayes bounds and the commonly used maximum loss over 
all possible structured outputs. 

As reported in [12] , by using the PAC-Bayes framework 
under Gaussian perturbations, we show that the com¬ 
monly used maximum loss over all possible structured 
outputs is an upper bound of the Gibbs decoder distor¬ 
tion up to statistical accuracy ((!l(-\/ for n training 

samples). 

Theorem 1 ( (Hj). Assume that there exists a fi¬ 
nite integer value I such that \ ^{x,y)^s'P{^)\ ^ 

5g (0,1). With probability at least 1 — ^/2 over the 
choice of n training samples, simultaneously for all pa¬ 
rameters w G yV and unit-variance Gaussian perturbation 

distributions Q{w) centered at w^J 2 log (2n£/||ii;||2), we 
have: 


L{Q{w),D) 



E 

{x,y)&S 


max d{y,y) 1 
y&y{x) 


f H{x,y,y) 
\-m{x,y,y,w) > 0 



|lw ||2 log {2ny/\\w\\l) + log (2n/(5) 


1 ) 


(See Appendix lAl for detailed proofs.) 

The proof of the above is based on the PAC-Bayes theo¬ 
rem and well-known Gaussian concentration inequalities. 
As it is customary in generalization results, a determinis¬ 
tic expectation with respect to the data distribution D is 
upper-bounded by a stochastic quantity with respect to 
the training set S. This takes into account the statistical 
aspects of the problem. 

Note that the upper bound uses maximization with re¬ 
spect to y{x) and that in general, |y(a:)| is exponential 
with respect to the input size. Thus, the computational 
aspects of the problem have not been fully addressed yet. 
In the next section, we solve this issue by introducing 
randomness. 


3 Prom PAC-Bayes to the Maxi¬ 
mum Loss Over Random Struc¬ 
tured Outputs 

In this section, we analyze the relationship between PAC- 
Bayes bounds and the maximum loss over random struc¬ 
tured outputs sampled independently from some proposal 
distribution. 

First, we will focus on the computational aspects. In¬ 
stead of using maximization with respect to y(a:), we 
will perform maximization with respect to a set T(w,x) 
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of random structured outputs sampled i.i.d. from some 
proposal distribution R(w,x) with support on y{x). In 
order for this approach to be computationally appealing, 
\T{w,x)\ should be polynomial, even when |3^(a;)| is ex¬ 
ponential with respect to the input size. 

Assumptions O and [B] will allow us to attain 
\T{w,x)\ = O (^max ( log (\// 3 ) Jkll^) ) ■ The constant 
/? € [0,1) is properly introduced on Assumption]^ It can 
be easily observed that (3 plays an important role in the 
number of random structured outputs that we need to 
draw from the proposal distribution R{w,x). Next, we 
present our first assumption. 

Assumption A (Maximal distortion). The proposal dis¬ 
tribution R(w, x) fulfills the following condition. There 
exists a value [3 G [0,1) such that for all {x,y) G S and 
w € W; 


P [diy,y') = 1] > 1-/3 

y' ,x) 

In Section 0] we show examples that fulfill the above 
assumption, which include a binary distortion func¬ 
tion for any type of structured output, as well as a 
distortion function that returns the number of differ¬ 
ent edges/elements for directed spanning trees, directed 
acyclic graphs and cardinality-constrained sets. 

Next, we present our second assumption that allows 
obtaining \T{w,x)\ = O (^max While 

Assumption contributes with the term in 

\T{w,x)\, the following assumption contributes with the 
term ||w ||2 in |T(ic,a:)|. 

Assumption B (Low norm). For any vector z G de¬ 
fine: 


y{z) = 



if zj^O 
if z = 0 


The proposal distribution R{w, x) fulfills the following 
condition for all {x,y) € S and ic G wH 


E 

y'r^R{w,x) 


[li{4>{x,y) - 4>ix,y'))] 


1 



1 

2 || w ||2 


It is natural to ask whether there are instances that 
fulfill the above assumption. In Section 0] we provide two 
extreme cases: one example of a sparse mapping and a 
uniform proposal, and one example of a dense mapping 
and an arbitrary proposal distribution. 

We will now focus on the statistical aspects. Note that 
randomness does not only stem from data, but also from 
sampling structured outputs. That is, in Theorem]!] ran¬ 
domness only stems from the training set S. We now 
need to produce generalization results that hold for all 

^The second inequality follows from an implicit assumption 
made in Theorem [T] i.e., HtuHj/n < 1. Note that if ||u)|| 2 /n > 1 
then Theorem [T] provides an upper bound greater than 1, which is 
meaningless since the distortion function d is at most 1. 


the sets T(w,x) of random structured outputs. In addi¬ 
tion, the uniform convergence of Theorem [T] holds for all 
parameters w. We now need to produce a generalization 
result that also holds for all possible proposal distribu¬ 
tions i?(tc,x). Therefore, we need a method for upper- 
bounding the number of possible proposal distributions 
R{w, x). AssumntionlClwill allow us to upper-bound this 
number. 

Assumption C (Linearly inducible ordering). 
The proposal distribution R{w, x) depends solely 
on the linear ordering induced by the parameter 
wGW and the mapping 4>{x,-). More formally, let 
r{x) = |A’(a;)| and thus y{x) = {yi .. .yr{x)}- Tef 
w,w' G W be any two arbitrary parameters. Let 
7r(a;) = (tti ... TTr/x)) be a permutation of {I...r(a;)} 
such that 4>{x,yTrf) ■ w < ■ ■ ■ < (l)(x,yTr.^f^^,^) ■ w■ Let 

7r'(a;) = (7r( ... 7r(,^^j) be a permutation of {I...r(a:)} 
such that 4>(x, ) • w' < • • • < (fix, y,r' ) ■ w'. For 
all w,w' GW and x G X, if 7r(a;) = 7r'(x) then 
KLiR(w, x)\\R{w', x)) = 0. In this case, we say that the 
proposal distribution fulfills i?(7r(a:), x) = i?(tc,x). 

Assumption ]C] states that two proposal dis¬ 
tributions Riw,x) and R{w',x) are the same 
provided that for the same permutation tt{x) 
we have (fix, j/tti ) ■ w < ■ ■ ■ < (f(x, y7r,,(„,)) • w and 
(fix, J • tc' < • • • < (fix, )-w'. Geometrically 
speaking, for a fixed x we first project the feature vectors 
(fix,y) of all the structured outputs y G yix) onto the 
lines w and w'. Let 7r(x) and 7r'(x) be the resulting 
ordering of the structured outputs after projecting them 
onto w and w' respectively. Two proposal distribu¬ 
tions Riw,x) and Riw',x) are the same provided that 
nix) = n'ix). That is, the specific values of (fix,y) ■ w 
and (fix, y) ■ w' are irrelevant, and only their ordering 
matters. 

In Section 0] we show examples that fulfill the above 
assumption, which include the algorithm proposed in 
for directed spanning trees, and our proposed gen¬ 
eralization to any type of data structure with computa¬ 
tionally efficient local changes. 

In what follows, by using the PAC-Bayes framework 
under Gaussian perturbations, we show that the maxi¬ 
mum loss over random structured outputs sampled inde¬ 
pendently from some proposal distribution provides an 
upper bound of the Gibbs decoder distortion up to sta¬ 
tistical accuracy ((!I(iog^''^n/y;r) for n training samples). 

Theorem 2. Assume that there exist finite inte¬ 
ger values i and r such that | U( 2 , P(x)| <£ and 

|A’(a:)| <r for all (x,y) G S. Assume that the pro¬ 
posal distribution Riw,x) with support on yix) ful¬ 
fills Assumption ]3] with value (3, as well as As¬ 
sumptions 0 and Fix S G (0,1) and an integer s 

such that 3 < s < + 1- With probability at least 

1 — i5 over the choice of both n training samples and 
n sets of random structured outputs, simultaneously 
for all parameters w GVd with ||w||q < s, unit-variance 
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Gaussian perturbation distributions Q(w) centered at 
w\J 2 log ( 2 n£/||w|| 2 ), and for sets of random structured 
outputs T(w,x) sampled i.i.d. from the proposal distribu¬ 
tion R{w, x) for each training sample (x, y) € S, such that 


\T{w,x)\= imax(j^^^^,32||w||2) log) 


we have: 


L{Q{w),D) 
1 


< 


H\ 


n 


{x,y)^S 

2 


^ ^yGT{w,x) \-m{x,y,y,w) >0 


||w ||2 log {2n^/\\w\\l) + log (2n/5) 


2(n- 1) 


+ 


+ 


“ax(ij;^y^,32||u;||^)' 


/slog (£+1) log^(n+l) 


+ 3 


s(log£ + 21og {nr)) + log (4/i5) 


(See Appendix lAl for detailed proofs.) 

The proof of the above is based on Theorem [T] as a 
starting point. In order to account for the computational 
aspect of requiring sets T{w,x) of polynomial size, we 
use Assumptions |A] and |B] for bounding a deterministic 
expectation. In order to account for the statistical as¬ 
pects, we use Assumption [C] and Rademacher complexity 
arguments for bounding a stochastic quantity for all sets 
T{w,x) of random structured outputs and all possible 
proposal distributions R{w,x). The assumption of spar¬ 
sity (i.e., ||ui||q < s) is pivotal for obtaining terms of order 
0{y/^^°B^)). Without sparsity, the terms would be of 
order 0{'\f^f^ which is not suited for high-dimensional 
settings. 


Inference on Test Data. Note that the upper bound 
in Theorem [2] holds simultaneously for all parameters 
rc € W. Therefore, our result implies that after learn¬ 
ing the optimal parameter ui € W in eq.® from training 
data, we can bound the decoder distortion when perform¬ 
ing exact inference on test data. More formally. Theorem 
[2]can be additionally invoked for a test set S", also with 
probability at least 1 — d. Thus, under the same setting 
as of Theorem O the Gibbs decoder distortion is upper- 
bounded with probability at least 1 — 25 over the choice 
of S and S'. In this paper, we focus on learning the pa¬ 
rameter of structured prediction models. We leave the 
analysis of approximate inference on test data for future 
work. 


4 Examples 


4.1 Examples for the Maximal Distortion 
Assumption 

In what follows, we present some examples that fulhll our 
Assumption!^ For a binary distortion function, we show 
that any type of structured output fulfills the above as¬ 
sumption. For a distortion function that returns the num¬ 
ber of different edges/elements, we show that directed 
spanning trees, directed acyclic graphs and cardinality- 
constrained sets, fulfill the assumption as well. 

For simplicity of analysis, most proofs in this part will 
assume a uniform proposal distribution R{w, x) = R(x) 
with support on y(x). In the following claim, we argue 
that we can perform a change of measure between differ¬ 
ent proposal distributions. Thus, allowing us to focus on 
uniform proposals afterwards. 

Claim i (Change of measure). Let R(w,x) and R'{w,x) 
two proposal distributions, both with support on y {x). As¬ 
sume that the proposal distribution R{w,x) fulfills As- 
sumvtion HI with value j5i. Let r^^^i') and 2 ,(-) be 
the probability mass functions of R{w,x) and R'{w,x) 
respectively. Assume that the total variation distance be¬ 
tween R{w,x) and R'{w,x) is bounded as follows for all 
(x, y) € S and w € W: 

TV{R{w,x)\\R'{w,x)) = ^ y \rn,,x{y) - r'^^,„{y)\ 

yey(x) 

< P 2 

The proposal distribution R'{w,x) fulfills Assumvtion HI 
with P = Pi -\- [32 provided that Pi -\- P 2 £ [0,1). 

Next, we provide a result for any type of structured 
output, but for a binary distortion function. 

Claim ii (Any type of structured output). Let y{x) be an 
arbitrary countable set of feasible decodings ofx, such that 
|3^(a;)| > 2 for all {x,y) € S. Let d{y,y') = l{y^y'). 
The uniform proposal distribution R{w, x) = R{x) with 
support on y{x) fulfills Assumption\^ with P = 1/2. 

The following claim pertains to directed spanning trees 
and for a distortion function that returns the number of 
different edges. 

Claim iii (Directed spanning trees). Let y{x) be 

the set of directed spanning trees of v nodes. Let 

A{y) be the adjacency matrix of y€y{x). Let 
d{y, y') = 2 (jtA) Ey \My)i 3 - Mvlij I • The uniform 
proposal distribution R{w, x) = R{x) with support on 
y{x) fulfills AssumvtionVM with P = 

The next result is for directed acyclic graphs and for a 
distortion function that returns the number of different 
edges. 

Claim iv (Directed acyclic graphs). Let y{x) be 

the set of directed acyclic graphs of v nodes and 

b parents per node, such that 2 < b < v — 2. Let 


In this section, we provide several examples that fulfill 
the three main assumptions of our theoretical result. 
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A[y) he the adjacency matrix of y€y{x). Let 
d{y, y') = bi 2 v\-i) T.ij l^(2/)y - ^{y'h I • The uniform 
proposal distribution R{w^ x) = R{x) with support on 
3^(a;) fulfills Assumptionl^with (3 = 

The final example is for cardinality-constrained sets 
and for a distortion function that returns the number of 
different elements. 

Claim V (Cardinality-constrained sets). Lety(x) he the 
set of sets of b elements chosen from v possible elements, 
such that b < v/2. Let d{y,y') = ^{\y-y'\ + \y' - y|). 
The uniform proposal distribution R(w,x) = R{x) with 
support on y{x) fulfills Assumption\^with ft = 1/2. 

4.2 Examples for the Low Norm Assump¬ 
tion 

Next, we present some examples that fulfill our Assump¬ 
tion m We provide two extreme cases: one example for 
sparse mappings, and one example for dense mappings. 

Next, we provide a result for a particular instance of a 
sparse mapping and a uniform proposal distribution. 

Claim vi (Sparse mapping). Let b > 0 be an arbitrary in¬ 
teger value. For all {x,y) € S, let 3^(a;) = ^p£'p(x)yp{x), 
where the partition yp{x) is defined as follows: 

{Vp eVix)) yp{x) = {y' I \(l)p{x,y)-(l)p{x,y')\ = b A 
(Vg p) (fqix, y) = (fq{x, y')} 

t/ u < \V{x)\/A for all {x,y) S S, then the uniform pro¬ 
posal distribution R{w,x) = R{x) with support on y{x) 
fulfills AssumvtionWi 

The following claim pertains to a particular instance of 
a dense mapping and an arbitrary proposal distribution. 

Claim vii (Dense mapping). Let b > 0 be an arbi¬ 
trary integer value. Let \(j)p{x,y) — (j)p{x,y')\ = b for all 
{x, y) £ S, y' € y{,x) and p G Vix). If n < |P(a;)|/4 for 
all (x, y) G S, then any arbitrary proposal distribution 
R{w,x) fulfills Assumvtion [bI 

4.3 Examples for the Linearly Inducible 
Ordering Assumption 

In what follows, we present some examples that fulfill 
our Assumption [Cl We show that the algorithm proposed 
in HnHH] for directed spanning trees, fulfills the above 
assumption. We also generalize the algorithm in |20I21) to 
any type of data structure with computationally efficient 
local changes, and show that this generalization fulfills 
the assumption as well. 

Next, we present the algorithm proposed in for 

dependency parsing in natural language processing. Here, 
a; is a sentence of v words and y{x) is the set of directed 
spanning trees of v nodes. 


Algorithm 1 Procedure for sampling a directed span¬ 
ning tree y' G y{x) from a greedy local proposal distri¬ 
bution R{w, x) 

Input: parameter w G W, sentence x £ X 

Draw nniformly at random a directed spanning tree 

is £ y{x) 

repeat 

s •<— post-order traversal of y 
for each node t in the list s do 

for each node u before t in the list s do 
y •<— change the parent of node t to u in y 
if 4>{x, y) ■ w > 4i{x, y) ■ w then 

y<-y 

end if 
end for 
end for 

until no refinement in last iteration 
Output: directed spanning tree y' y 


The above algorithm has the following property: 

Claim viii (Sampling for directed spanning trees). Al- 
gorithm[J\ fulfills Assumvtionl^ 

Note that Algorithm [1] proposed in nnn] uses the fact 
that we can perform local changes to a directed span¬ 
ning tree in a computationally efficient manner. That 
is, changing parents of nodes in a post-order traversal 
will produce directed spanning trees. We can extend the 
above algorithm to any type of data structure where we 
can perform computationally efficient local changes. For 
instance, we can easily extend the method for directed 
acyclic graphs (traversed in post-order as well) and for 
sets with up to some prespecified number of elements. 

Next, we generalize Algorithm |T] to any type of struc¬ 
tured output. 


Algorithm 2 Procedure for sampling a structured out¬ 
put y' £ 3^(a:) from a greedy local proposal distribution 
R{w, x) 

Input: parameter w £ W, observed input x £ X 
Draw uniformly at random a structured output y £ y(x) 

repeat 

Make a local change to y in order to increase 4i{x, y) • u) 
until no refinement in last iteration 
Output: structured output y' ^ y 


The above algorithm has the following property: 

Claim ix (Sampling for any type of structured output). 
Algor ithm\^ fulfills Assumvtion[^ 

5 Experimental Results 

In this section, we provide experimental evidence on syn¬ 
thetic data. Note that the work of [20l|21] has provided 
extensive experimental evidence on real-world datasets, 
for part-of-speech tagging and dependency parsing in 


5 














Table 1: Average over 30 repetitions, and standard error at 95% confidence level of several methods and measurements. 
For the maximum loss over all possible structured outputs (All) we used eq.(l4]) for training, and eq.([T]) for inference 
on a test set. For the maximum loss over random structured outputs (Random and Random/All) we used eq.JS]) for 
training. For inference, Random used eq.® while Random/All used eq.(IT]). Random outperforms All in the different 
study cases (directed spanning trees, directed acyclic graphs and cardinality-constrained sets). The difference between 
Random and Random/All is not statistically significant. 


Problem 

Method 

Training 

runtime 

Training 

distortion 

Test 

runtime 

Test 

distortion 

Distance to 
ground truth 

Angle with 
ground truth 

Directed 

All 

1000 

52% ±1.1% 

12.4 ± 0.4 

61% ± 1.8% 

0.56 ± 0.004 

74° 

± 0.3° 

spanning trees 

Random 
Random/All 

104 ± 3 

38% ± 2.1% 

2.4 ± 0.1 

12.4 ± 0.3 

56% ± 1.9% 
56% ± 1.9% 

0.51 ± 0.005 

49° 

± 0.6° 

Directed 

All 

1000 

41% ± 1.2% 

10.8 ± 0.2 

45% ± 1.5% 

0.60 ± 0.020 

61° 

± 1.0° 

acyclic graphs 

Random 
Random/All 

386 ± 21 

30% ± 1.3% 

8.5 ± 0.2 
10.8 ± 0.2 

39% ± 1.6% 
39% ± 1.6% 

0.40 ± 0.008 

CO 

0 

± 1.0° 

Cardinality 

All 

1000 

42% ± 1.4% 

11.1 ± 0.4 

45% ± 1.8% 

0.58 ± 0.011 

65° 

± 0.6° 

constrained sets 

Random 
Random/All 

272 ± 9 

21% ± 1.2% 

6.0 ± 0.2 

10.9 ± 0.3 

30% ± 1.9% 
29% ± 2.1% 

0.44 ± 0.008 

CO 

O 

0 

± 0.8° 


the context of natural language processing. Our ex¬ 
perimental results are not only for directed spanning 
trees [501 HI] but also for directed acyclic graphs and 
cardinality-constrained sets. 

We performed 30 repetitions of the following pro¬ 
cedure. We generated a ground truth parameter w* 
with independent zero-mean and unit-variance Gaus¬ 
sian entries. Then, we generated a training set S of 
n = 100 samples. The fixed mapping <f> from pairs 
{x,y) to feature vectors (j){x,y) is as follows. For ev¬ 
ery pair of possible edges/elements i and j, we define 
4>ij{x, y) = 1 {xij = lAi€yAj€y). For instance, for 
directed spanning trees of v nodes, we have x G {0,1}(2) 
and (j){x,y) In order to generate each training 

sample {x, y) € S, we generated a random vector x with 
independent Bernoulli entries, each with equal probabil¬ 
ity of being 1 or 0. After generating a;, we set y = (x). 

That is, we solved eq. o in order to produce the latent 
structured output y from the observed input x and the 
parameter w*. 

We compared two training methods: the maximum 
loss over all possible structured outputs as in eq. ®, and 
the maximum loss over random structured outputs as in 
eq. ®. For both minimization problems, we replaced the 
discontinuous 0/1 loss 1 (z > 0) with the convex hinge 
loss max (0,1-I-z), as it is customary. For both prob¬ 
lems, we used A = 1/n as suggested by Theorems [T] and 
[21 and we performed 20 iterations of the subgradient de¬ 
scent method with a decaying step size l/-\/t for iteration 
t. For sampling random structured outputs in eq.®, 
we implemented Algorithm [2] for directed spanning trees, 
directed acyclic graphs and cardinality-constrained sets. 
We considered directed spanning trees of 6 nodes, di¬ 
rected acyclic graphs of 5 nodes and 2 parents per node, 
and sets of 4 elements chosen from 15 possible elements. 
We used /3 = 0.8 for directed spanning trees, P = 0.85 
for directed acyclic graphs, and /3 = 0.5 for cardinality- 


constrained sets, as prescribed by Claims [ml [13 and 0 
After training, for inference on an independent test set, 
we used eq. o for the maximum loss over all possible 
structured outputs. For the maximum loss over random 
structured outputs, we use the following approximate in¬ 
ference approach: 

fyj{x) = a.Tgm.ax(j){x,y) ■ w (6) 

y^T(w,x) 

Table in shows the average over 30 repetitions, and the 
standard error at 95% confidence level of the following 
measurements. We report the runtime, the training dis¬ 
tortion as well as the test distortion in an independently 
generated set of 100 samples. We also report the nor¬ 
malized distance of the learnt w to the ground truth w*, 
i.e., \\w — w*\\ 2 /'</i- Additionally, we report the angle 
of the learnt w with respect to the ground truth w*, 
i.e. arccos({(; • r(;*/(||{c|| 2 l|w*|| 2 ))- In the different study 
cases (directed spanning trees, directed acyclic graphs 
and cardinality-constrained sets), the maximum loss over 
random structured outputs outperforms the maximum 
loss over all possible structured outputs. 

6 Discussion 

In this section, we provide more details regarding the 
computational complexity of the inference problem. We 
also present a brief review of the previous work and pro¬ 
vide ideas for extending our theoretical result. 

Computational Complexity of the Inference Prob¬ 
lem. Very few cases of the general inference problem in 
eq. are tractable. For instance, if y{x) is the set of 
directed spanning trees, and w is a vector of edge weights 
(i.e., linear with respect to y), then eq.® is equivalent 
to the maximum directed spanning tree problem, which 
is polynomial-time. In general, the inference problem in 
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eq. © is not only NP-hard but also hard to approximate. 
For instance, if y{x) is the set of directed acyclic graphs, 
and i/; is a vector of edge weights (i.e., linear with respect 
to y), then eq.(IT]) is equivalent to the maximum acyclic 
subgraph problem, which approximating within a factor 
better than 1/2 is unique-games hard [S]. As an addi¬ 
tional example, consider the case where y{x) is the set 
of sets with up to some prespecified number of elements 
(i.e., y{x) is a cardinality constraint), and the objective 
(p{x,y) ■ w is submodular with respect to y. In this case, 
eq. cannot be approximated within a factor better than 
1 — 1/e unless P=NP [14] . 

These negative results made us to avoid interpreting 
the maximum loss over random structured outputs in 
eq.® as an approximate optimization algorithm for the 
maximum loss over all possible structured outputs in 
eq.®. 

Previous Work. Approximate inference was proposed 
in [10] . with an adaptation of the proof techniques in 
[12] . More specifically, [10] performs maximization of the 
loss over a superset of feasible decodings of a;, i.e., over 
y G y'{x) A 3^(a:). Note that our upper bound of the 
Gibbs decoder distortion dominates the maximum loss 
over y G y{x), and the latter dominates the upper bound 
of [10] . One could potentially use a similar argument with 
respect to a subset of feasible decodings of x, i.e., with re¬ 
spect to y G y'{x) C 3^(x). Unfortunately, this approach 
does not obtain an upper bound of the Gibbs decoder 
distortion. 

Tangential to our work, previous analyses have ex¬ 
clusively focused either on sample complexity or con¬ 
vergence. Sample complexity analyses include margin 
bounds [18] , Rademacher complexity m and PAC-Bayes 
bounds [nids]. Convergence have been analyzed for spe- 
cihc algorithms for the separable [5] and nonseparable [7] 
cases. 

Concluding Remarks. The work of [20ll^ has shown 
extensive experimental evidence for part-of-speech tag¬ 
ging and dependency parsing in the context of natural 
language processing. In this paper, we present a the¬ 
oretical analysis that explains the experimental success 
of [20l[21] for directed spanning trees. Our analysis was 
provided for a far more general setup, which allowed 
proposing algorithms for other types of structured out¬ 
puts, such as directed acyclic graphs and cardinality- 
constrained sets. We hope that our theoretical work 
will motivate experimental validation on many other real- 
world structured prediction problems. 

There are several ways of extending this research. 
While we focused on Gaussian perturbations, it would be 
interesting to analyze other distributions from the compu¬ 
tational as well as statistical viewpoints. We analyzed a 
general class of proposal distributions that depend on the 
induced linear orderings. Algorithms that make greedy 
local changes, traverse the set of feasible decodings in a 


constrained fashion, by following allowed moves defined 
by some prespecified graph. The addition of these graph- 
theoretical constraints would enable obtaining tighter up¬ 
per bounds. From a broader perspective, extensions of 
our work to latent models [nds] as well as maximum 
a-posteriori perturbation models [5][TC] would be of great 
interest. Finally, while we focused on learning the pa¬ 
rameter of structured prediction models, it would be in¬ 
teresting to analyze approximate inference for prediction 
on an independent test set. 
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SUPPLEMENTARY MATERIAL. 

Structured Prediction: Prom Gaussian Perturbations 
to Linear-Time Principled Algorithms 


A Detailed Proofs 

In this section, we state the proofs of all the theorems and claims in our manuscript. 


A.l Proof of Theorem [T] 

Here, we provide the proof of Theorem [TJ First, we derive an intermediate lemma needed for the final proof. 

Lemma 1 (Adaptec(f| from Lemma 6 in [12]1. Assume that there exists a finite integer value I such that 

I U(a:,i/)GS ^(2^)1 ^ -^6^ Q{w) be a unit-variance Gaussian distribution centered at aw for a = \j2 log ( 2 n£/||w|| 2 ). 

Simultaneously for all (x,y) € S, y' € y(x) and w G W, we have: 

P \H{x,y' - m{x,y'Jy,fx),w) < 0] < HwH^/n 

w''^Q{w) 


or equivalently: 


w' 


P \H{x,y’ ,fyjfx)) 


mix,y',fni’ix),w) > 0] > 1 - HwH^/n 


(7) 


Proof. First, note that w' — aw is a zero-mean and unit-variance Gaussian random vector. By well-known Gaussian 
concentration inequalities, for any p G ’P(x) we have: 

P [Iw' — awpl > e] < 2e“® 
w'^Qiwy' P P' 


By the union bound and setting e = a = y 21og (2n£/||r(;||2), we have: 

P [(3p G U(^,y)gsP(x)) \w' - awp\ > a] < 2| U(^,y)g 5 

w'^Q{w) 

< \\w\\l/n 


or equivalently: 


P [(Vp G U(,r,y)gsP(a:)) \w' - awp\ < a] > 1 - \\w\\l/n 

W'r^Q{w) 

The high-probability statement in eq.© can be written as: 

y = fw'{x) ^ H{x,y\y) - m{x,y',y,w) >0 
Next, we use proof by contradiction, i.e., we will assume: 

y = fwfx) and H{x,y',y) - m{x,y',y,w) < 0 

^ We make two small corrections to Lemma 6 of m- First, it is only stated for y' = fw{x) but it does not make use of the optimality of 
fw(x)^ thus, it holds for any y' G Second, for the union bound over all p G assume that | U(^x,y)GS ^(^)l ^ Instead, 

Lemma 6 in m incorrectly assumes \‘P(x)\ < i for all fc G A', and thus | Ufa-^(^)l ^ y)GS l^(^)l — 
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and arrive to a contradiction y ^ fw'{x). From the above, we have: 

m(x, y', y, w') = •m{x, y\ y, aw + (w' — aw)) 

= am{x, y', y, w) - {(j)[x, y') - cj){x, y)) ■ {aw - w') 

> olH{x, y',y)- {(j){x, y') - cj){x, y)) ■ {aw - w') 

= aH{x,y',y) - ^ {c{p,x,y') - c{p,x,y)){awp - w'^) 

p&'P{x) 

>aH{x,y',y)- ^ \c{p,x,y') - c{p,x,y)\\awp - Wp\ 

p^V{x) 

>aH{x,y',y)- ^ \c{p,x,y') - c{p,x,y)\a 

p&'P{x) 

= 0 


Note that m{x,y',y,w') > 0 if and only if (j){x,y') ■ w > (j){x,y) ■ w. Therefore y ^ fw'{x) since it does not maximize 
4>{x, ■) ■ w as defined in eq.(IT|). Thus, we prove our claim. □ 

Next, we provide the final proof. 

Proof of Theorem^^ Define the Gibbs decoder empirical distortion of the perturbation distribution Q{w) and training 
set S as: 

L{Q{w),S) = - Y. \d{y,U,{x))] 

In PAC-Bayes terminology, Q{w) is the posterior distribution. Let the prior distribution P be the unit-variance zero- 
mean Gaussian distribution. Fix 5 G (0,1) and a > 0. By well-known PAG-Bayes proof techniques, Lemma 4 in |12| 
shows that with probability at least 1 — S/2 over the choice of n training samples, simultaneously for all parameters 
w G W, and unit-variance Gaussian posterior distributions Q{w) centered at wa, we have: 


L{Q{w),D) < L{Q{w),S) + 


= L{Q{w),S) + 


IKL{Q{w)\\P) + log {2n/S) 


2{n-l) 


||r(;||2Q:^/2 -I- log {2n/S) 


2{n- 1) 


( 8 ) 


Thus, an upper bound of L{Q{w), S) would lead to an upper bound of L{Q{w), D). In order to upper-bound L{Q{w), S), 
we can upper-bound each of its summands, i.e., we can upper-bound E^>^Q(^^'^[d{y, fu!'{x))] for each {x,y) G S. Define 
the distribution Q{w,x) with support on y{x) in the following form for all y G y{x): 


, P y = y] = , P, yw'{x)=y] 

y' r^Q[w ,x) w' '^Q{fw) 

For clarity of presentation, define: 

u{x, y, y', w) = H{x, y, y') - m{x, y, y', w) 

Let u = u{x, y, fw'{x), w). Simultaneously for all (x, y) G S, we have: 

E [d{y,U>{x)]= E [d{y,U>{x)) l{u>0) + d{y,U’{x)) l{u<0)] 

w'r^Q{w) w'r^Q{w) 

< E [d{y, fw'{x)) l{u>0) + l{u< 0)] 

W'r^Q{w) 

= \d{y,fw'{x)) l{u>0)]+ P [u<0] 

- , yy’ fw'{x)) I {u > 0)] -b \\w\\l/n 

w''^Q{w) 

= E, \d{y,fw'{x))l{u{x,y,fn,>{x),w)>0)] + \\w\\l/n 

= , E [d{y,y') l{u{x,y,y',w) >0)] + \\w\\l/n 

y'r^Q{w,x) 

< max d{y,y) 1 {u{x,y,y,w) > 0) -b ||w|| 2 /« 
y^y{x) 


(9) 


(lO.a) 

(lO.b) 
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(lO.c) 

(lO.d) 






where the step in ea. dlO.aD holds since d : x ^ [0,1]. The step in ea. dlO.bl) follows from Lemma [T] which states that 

^w'r^Q{w)[uix,y', fu,'{x),w) < 0 ] < llicll^/n for a = Y^ 21 og ( 2 n£/||r(;|| 2 ), simultaneously for all {x,y) € S, y' G y^x) and 
w G W. By the definition in eq.®, then the step in ea. dlO.cl) holds. Let 5 : 3^ —?► [0,1] be some arbitrary function, the 
step in ea. dlO.dl) uses the fact that £^[ 3 ( 2 /)] < ma.Xyg{y). 

By eq.® and ea. dlO.dl) . we prove our claim. □ 


A.2 Proof of Theorem [2] 


Here, we provide the proof of Theorem [21 First, we derive an intermediate lemma needed for the hnal proof. 
Lemma 2. Let A G be a random variable, and w G be a constant. //E[/r(A)] ■ w < 1/2 then we have: 


Proof. Let t > 0, we have that: 


|A||^ — A • w < 0] < exp 


-1 


32||n;||^ 


P[ A — A • w; < 0] = P[/r(A) • w > 1] 

(ll.a) 

= P[(^(A) — E[^(A)]) ■ w > 1 — E[/r(A)] • w] 


< P[(m(A) - E[^(A)]) • w > 1/2] 

(ll.b) 

= P[exp {t{g{A) - E[/r(A)]) • w) > e*/^] 


< E[exp (t(/x(A) — E[^(A)]) • w)] 

(ll.c) 

< exp (^—t/2 + 2t^ |w [ 2 ^ 

(ll.d) 

where the step in eq.pi.aj) follows from dividing A ^ — A • w by A Note that A = 

0 does not fulfill either of 


the two expressions ||A||^ — A • w < 0, or /r(A) • w > 1. The step in ea. (|ll.bp follows from E[/r(A)] • w < 1/2 and 


thus 1 — E[^(A)] • u> > 1/2. The step in eg. dll. cl) follows from Markov’s inequality. The step in ea. dll.dl) follows 
from Hoeffding’s lemma and the fact that the random variable z = {y.{A) — E[/x(A)]) • w fulfills E[z] = 0 as well as 
z G [— 2 ||r(;|| 2 ,+ 2 ||ri;|| 2 ]. In more detail, note that ||/i(A )||2 < 1 since it holds trivially for A = 0, and for A 7 ^ 0 we 
have that ||^(A )||2 = ||AII 2 /IIA||^ < 1. By Jensen’s inequality ||E[/z(A )]||2 < E[||/i(A)|| 2 ] < 1. Then, note that by 
Cauchy-Schwarz inequality |(/i(A) - E[/x(A)]) • w\ < ||/i(A) - E[^(A)]|| 2 ||w ;||2 < (||/i(A )||2 + ||E[/z(A)]|| 2 )||w ;||2 < 2 ||u;|| 2 . 
Finally, let g{t) =—t/2 + 2t‘^\\w\\2. By making dg/dt = 0, we get the optimal setting t* = l/( 8 ||ii;|| 2 ). Thus, 
g{t*) = —l/(32||w;||2) and we prove our claim. □ 


Next, we provide the final proof. 


Proof of Theorem\^ Note that sampling from the distribution Q{w,x) as defined in eq.® is NP-hard in general, thus 
our plan is to upper-bound the expectation in ea. (| 10 .cp by using the maximum over random structured outputs sampled 
independently from a proposal distribution R{w,x) with support on T(a:). 

Let T{w,x) be a set of n' i.i.d. random structured outputs drawn from the proposal distribution R{w,x), 
i.e., T{w,x) ^ R{w,x)^ . Furthermore, let T{w) be the collection of the n sets T{w,x) for all {x,y)GS, i.e. 
T(w) = {T{w,x)}(^x,y)&s and thus T(i/;) ^ {R{w,x)'^ }{x,y)^s- For clarity of presentation, define: 

v{x, y, y',w) = d{y, y') 1 {H (a;, y, y') - m(x, y, y', w) > 0 ) 


For sets T{w, x) of sufficient size n', our goal is to upper-bound ea. dlO.cl) in the following form for all parameters w G W: 


1 

n 


E , 

{x,y)es ‘ 


E [v{x,y,y' 

-'Q{w,x) 


w)] < - 
n 


E 

(x,y)^S 


max v{x,y,y,w)-\-0{\og^'‘^n/^) 

y^T{w,x) 


Note that the above expression would produce a tighter upper bound than the maximum loss over all possible structured 
outputs since maxgg 7 '(„,_ 2 ,) v{x, y, y, w) < maxggy( 2 ,)n(a:, y, y, w). For analysis purposes, we decompose the latter equation 
into two quantities: 


{x,y)eS 

B{w,S,T{w)) = - V 
n 

{x,y)£S 


E [v{x,y,y,w)]- E 

y'r^Q{w,x) T{w,x)~R(w,x)'‘ 


max v{x,y,y,w) 


yGT{w,x) 


( E 

max v{x,y,y,w) 

\T{w,x)'^R{w,x)'^' 

y^T{w,x) 


max v{x,y,y,w) 

y^T{fw,x) 


( 12 ) 

(13) 
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Thus, we will show that A^w, S) < and B(w^ S, 'ir(w)) < O(iog^/^ "/%Ai) for all parameters w G W, any training set 

S and all collections T(ri;), and therefore A{w, S) + B{w, S, T(ri;)) < 0( nj^). Note that while the value of A{w, S) 

is deterministic, the value of B{w, S,T{w)) is stochastic given that T(ui) is a collection of sampled random structured 
outputs. 

Fix a specific w G >V. If data is separable then v{x,y,y',w) = 0 for all (a:, y) G S and y' G y{x). Thus, we have 
A{w,S) = B{w, S,T{w)) = 0 and we complete our proof for the separable caseo In what follows, we focus on the 
nonseparable case. 


Bounding the Deterministic Expectation A(w^ S). Here, we show that in ea. lfT^ . A{w, S) < for all param¬ 
eters w and any training set S', provided that we use a sufficient number n' of random structured outputs sampled 
from the proposal distribution. 

By well-known identities, we can rewrite: 


A{w,S) = - y] / f P \v{x,y,y',w) < z]^'- P [v{x,y,y',w) < z]]dz 

\y'^R{'w,x) y'r^Q{w,x) J 

<- y P [v{x,y,y',w) < l]'^' 

= - ^ My,y)<l'^H{x,y,y)-m{x,y,y,w)<0]'^' 

= - ^ ("l- P [d{y,y) = lAH{x,y,y)-m{x,y,y,w)>0]) 

n ^ V y'~R{w,x) J 


{x,y)es 

<- y (l-min( P [d{y,y') = l], P [H{x,y,y') - m{x,y,y',w) > 0] 

n , _ V \y'r^R{w,x) y'~R{w,x) 


{x,y)eS 

= — max ( 1 — 


{x,y)eS 


P = 1] : P jHix,y,y') - m{x,y,y',w) < 0] 

y'^R{w,x) y'r^R{w^x) 


-I 


(I4.a) 


< max j5 , exp , „ 

V V32|kiy 

= \/iy 


(14.b) 

(I4.C) 


where the step in ea. (II4.a|) holds since for two independent random variables 5 ,hG [0,1], we have 
E[y] = 1 —P[y < z]c?z and P[max (y,/i) < z] = P[y < 2 ] P[/i < zj. Therefore, E[max(y,/i)] = 1 — 

< z\'^[h < z\dz. For the step in ea. (ll4.bl) . we used Assumption O for the first term in the max. 
For the second term in the max, we used Assumption |B| More formally, let A = (j){x,y) — (f){x,y') then 
H{x,y,y') = ||A||j and m{x,y,y' ,w) = A-w. By Assumption [B1 we have that ||E[y(A )]||2 < < l/(2||r(;||2). 

By Cauchy-Schwarz inequality we have E[y(A)] • w < ||E[/r(A)]|l 2 ||w ||2 < ||w||2/(2||w||2) < 1/2. Since 

E[/r(A)] • w < 1/2, we apply Lemma [2] in the step in ea. (|14.bp . For the step in ea. ()14.cp . let 

a = max 3211 ^ 112 ^ Note that max ^/? , exp Furthermore, let n'= ^alogn. There¬ 
fore, max , exp ^ 32|j/‘,||^ )) = _g^iogn _ yJYfn. 


Bounding the Stochastic Quantity B{w, SjT^w)). Here, we show that in ea. dT^ . B^w, S,T{w)) < n/^) 

for all parameters w G W, any training set S and all collections T(r(;). For clarity of presentation, dehne: 

g{x, y, T, w) = ma.xv{x, y, y, w) 

y&T 

Thus, we can rewrite: 

B{w,S,T{w)) = - y f P [g{x,y,T{w,x),w)]-g{x,y,T{w,x),w)] 

The same result can be obtained for any subset of S for which the “separability” condition holds. Therefore, our analysis with the 
“nonseparability” condition can be seen as a worst case scenario. 
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Let r(a;) = |3^(a;)| and thus y{x) = {yi.. .yr(x)}- Let 7r(x) = (tti ... 7rj.(2^)) be a permutation of {l...r’(a;)} such 
that ■ w < ■ ■ ■ < 4>{x,yTVr^x)) ' Let 11 be the collection of the n permutations 7r(x) for all {x,y) € S', i.e. 

n = {7r(x)}(a;_j,)g5. From Assumption lUl we have that R(7r(x),x) = R(w,x). Similarly, we rewrite T(7r(x),x) = T(w,x) 
and T(n) = T(z/;). 

Furthermore, let Wn.s be the set of all w G W that induce If on the training set S. For the parameter space W, 
collection If and training set S, define the function class ©w,n,s as follows: 

©w,n,s = {g(x, y,T,w) \iu € >Vn,s A (x, y) G S} 

Note that since |A’(x)| < r for all {x,y) G S, then | U( 2 ,_y)gs A’(a::)| < |3^(2:)| < nr. Note that each ordering of 

the nr structured outputs completely determines a collection 11 and thus the collection of proposal distributions R{w^ x) 
for each {x,y) G S. Note that since | U( 2 ,_y)gs 7^(a;)| < £, we need to consider (j){x,y) G Although we can consider 
w G the vector w is sparse with at most s non-zero entries. Thus, we take into account all possible subsets of s 
features from i possible features. From results in [UI31IS], we can conclude that there are at most {nr)^^^~^'> linearly 
inducible orderings, for a fixed set of s features. Therefore, there are at most (f) < £^{nr)^^ collections 11. 

Fix 6 G (0,1). By Rademacher-based uniform convergenctQ and by a union bound over all £^{nr)'^^ collections If, with 
probability at least 1 — 5/2 over the choice of n sets of random structured outputs, simultaneously for all parameters 
w GW: 


B{w, S', T(u;)) < 2 IHT(n) (©w,n,s) + 3 


s(log£ -I- 21og (nr)) + log (4/5) 


(15) 


where lHT(n)(©w,n,s) is the empirical Rademacher complexity of the function class ©vv.n.s with respect to the collection 
T(n) of the n sets T(7r(a:),a:) for all (x, y) G S. For clarity, define: 


Ap(x,y,y') 


c(p, X, y) 
0 


c(p,x,y') iipGV(x) 
otherwise 


Let fj be an n-dimensional vector of independent Rademacher random variables indexed by (x,y) G S, i.e., 
'^[<X{x,y) = +1] = IP’[o'(x.y) = ~1] = 1/2- The empirical Rademacher complexity is defined as: 


^T(n)(©w,n,s) = IE 


sup I - <X{x,y)gix,y,T(Tr(x),x),w) 

g^e^.u.s \^^x,y)ss 


= E 


= E 


= E 


sup 1 - V cr( 2 , ) max d(y,y) 1 (H(x,y,y) - m(x,y,y,w) > 0) 

weWn.s \ U yeT(77(x).x) 

sup I- V cr(a;,y) . max (i(y, y) 1 (||A(x, y, y)||^ - A(x, y, y) • u; > 0) 

i»GWn,s \ U- ' S6T(77 (x).x) 


sup 1 - CTi max dij 1 (\\zij\\, - Zij ■ w > o) 

»eMA{0} U 


< e 

< y: E 

iG{l...ri'} 

s E f 

iG{l...n'} 


sup I- (Xi dij 1 (WzijWj^ - Zij ■ w > o) 


sup I — (Ji 1 

-6KA10} 


111 Zij ■ W 


> 0 ) 


sup 1 - 1 ■ W >0) 

»eR^+A{0} 


< 2n‘ 


, /slog(£-H l)log(n-h 1) 


(16.a) 

(16.b) 

(16.C) 

(16.d) 

(16.e) 


^ Note that for the analysis of B(w, S,T{w)), the training set S is fixed and randomness stems from the collection T(tt;). Also, note that 
for applying McDiarmid’s inequality, independence of each set T(w,x) for all {x,y) € 5 is a sufficient condition, and identically distributed 
sets T(w,x) are not necessary. 
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where in the step in ea. dlh.al) . the terms at, dtj and Zij correspond to <J(x,y), d[y,y) and A{x,y,y) respectively. Thus, 
we assume that index i corresponds to the training sample {x, y) € S', and that index j corresponds to the structured 
output y G T(7r(x), x). Note that since | 'P[x)\ < I, thus the step in ea. dlh.al) considers w, Zij G \ {0} without 

loss of generality. The step in ea. dl6.bl) follows from the fact that for any two function classes 0 and ij, we have that 
9l({max {g,h) \ g G & A h G io}) < 1H(©) + IH(io). The step in ea. (jl6.cp follows from the composition lemma and the 
fact that dij G [0,1] for all i and j. The step in ea. (|16.d|) considers a larger function class, since the value of ||zy can 
be taken as an additional entry in the vector Zy we consider w, Zij G \ {0}. The step in ea. dl6.el) follows from the 
Massart lemma, the Sauer-Shelah lemma and the VC-dimension of sparse linear classifiers. That is, for any function 

class 0, we have that 91(0) < VC(0) is the VC-dimension of 0. Furthermore, by Theorem 20 

of [15], VC(0) < 2s log (£ + 1) for the class 0 of sparse linear classifiers on with 3 < s < + 1. 

By eq.(l8]), ea. dlO.cL ea. dl4.cL ea. (IT5]) and ea. dl6.cll . we prove our claim. □ 


A.3 Proof of Claim III 

Proof. For all {x, y) G S and w G W, by definition of the total variation distance, we have for any event A{x, y, y', w): 


P [A{x,y,y',w)] 




y'' 


p 

R' {w,x) 


[Aix,y,y',w)] 


< TV{R{w,x)\\R'(w,x)) 


Let the event A{x, y, y',w) : d{y, y') = 1 A H{x, y, y') — m{x, y, y', w) > 0. Since R{w, x) fulfills AssumotionlAlwith value 
/?! and since TV(R(w,x)llR'(w,x)) < /?2, we have that for all (x,y) G S and w G W: 


P [A(x,y,y',w)] > P [A(x,y,y',w)] - TV(R(w,x)HR'(w,x)) 

y'r-^R'(w,x) y'r^R(w,x) 

> 1 — /?! — /I 2 


which proves our claim. 


□ 


A.4 Proof of Claim [n| 

Proof. Since d(y, y') = 1 (y 7^ y') and since R{x) is a uniform proposal distribution with support on V(a;), we have: 


P 

y'~R(x) 


[d{.y,y') 




> 1 - 1/2 


(17.a) 


where the step in ea. dl7.al) follows since |V(a:)| > 2. 


□ 


A.5 Proof of Claim [ml 

Proof. Let s = (si, S2, S3 ... s„) be the pre-order traversal of y. Let s' = (s2, si, S3 ... s„) be a node ordering where we 
switched si with S2. Let y'{x) be the set of directed spanning trees of v nodes with node ordering s'H Let R'{x) be the 
uniform proposal distribution with support on y'{x). Since y'{x) is the set of directed spanning trees of v nodes with 
a specific node ordering, then \y{x)\ = 111=2 (*-!) = («- 1)!. Moreover, since d{y,y') = 2 {v-i) ^ij \My)^j “ My')v\ 


® We use the node ordering s' in order to have trees in y'{x) with all edges different from y. If we use the node ordering s instead, every 
tree in y'(x) will contain the edge ( 52 , 51 ), thus no tree in y'{x) will have all edges different from y. 
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and since R'{x) is a uniform proposal distribution with support on y'(x), we have: 


, P \d{y,y') = 1] > , P \d{y,y') = 1] 

y'r^R{x) y'~R'{x) 


y'^R'(x) 

1 


(u — 1)! 

y&y'{x) \ ij 


X! “ ^(y')*jl = 2(u - 1) 


X! ^ \^\My)i3 - My)i3\ =- ^) 


1 


(u-l)! ^ 


n(*-2) 


(18.a) 


= 1 - 


v -2 

V — I 


where the step in eq. (IlS.all follows from the fact that when choosing the parent for the node in position i in the ordering 
s', we have one option less (i.e., the option that is in y). □ 


A.6 Proof of Claim [13 

Proof. Let s = (si, S 2 , S 3 ... s„) be the pre-order traversal of y. Let s' = (s 2 , si, S 3 ... Sy) be a node ordering where 
we switched si with S 2 . Let y'{x) be the set of directed acyclic graphs of v nodes and b parents per node, and 
with node ordering s'0 Let R'{x) be the uniform proposal distribution with support on y'{x). Since y'{x) is 
the set of directed acyclic graphs of v nodes and b parents per node, and with a specific node ordering, then 
|3^'(2:)I = (* - 1) riLb +2 C~b^) = iri=b +2 c~b^)- Moreover, since d{y, y') = yi^ 2 v\-i) T,ij \My )^3 “ Mylijl and 

since R' (x) is a uniform proposal distribution with support on y'(x), we have: 


, P My,y') = 1] > , P [diy,y') = 1] 

y'r^Rix) y''^R'{x) 


y'r^R'ix) 

(«n 

\ 2=6+2 


\My )^3 - My')i 3 1 = -b-i) 


^3 


h’ n 

\ i=b+ 

1 er )-1 


i — \ 
b 

i- 1 
b 


-1 


-1 


1 = 

vey'ix) \ i3 

6+1 V 

n(*-2) n ([^^1-1 

2=3 2=6+2 


i — 1 


(19.a) 


n 


tb^) -1 


b ) i=b+3 ( b ) 

R 

2=6+3 


1 (b+l\ _ 1 ^ 

> - LlJ _^ rr 

- 5 (b+U 11 
\ 6 / i—. 


C2^) 


bv 


(&2 + 36 + 2 )( u - 2 ) 

b^ + 2 b + 2 


(19.b) 

(19.C) 


> 1 - 


b^+3b + 2 


where the step in ea. (ll9.al) follows from the fact that when choosing the b parents for the node in position i in the 
ordering s', we have one option less (i.e., the option that is in y). The step in ea. (jl9.bp follows from the fact that the 
function is nondecreasing as well as ( 2 ) < (j) for a > 6 -I- 2 and b> 2. The step in ea. (ll9.c[) follows from the fact 
u/(w — 2) > 1 for u > 2. □ 


We use the node ordering s' in order to have graphs in y'{x) with all edges different from y. If we use the node ordering s instead, every 
graph in y'(x) will contain the edge (s 2 , ^i), thus no graph in y\x) will have all edges different from y. 
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A.7 Proof of Claim E 


Proof. Since 3^(a;) is the set of sets of b elements chosen from v possible elements, then |3^(x)| = (^). Moreover, since 
d{y,y') = ^(ly “ y'\ + \y' ~ y\) and since R{x) is a uniform proposal distribution with support on y{x), we have: 


P 

y'~R(x) 


[d{y,y') 


1] = , P, j\y - y'\ + \y' - y\ =26] 

y'r^R{x) 

= 1- , P ,{\y - y'\ + \y' - y\ < 26] 

y'r^Rix) 

= ^“(5) i(ly-yl + ly-yl < 26 ) 

^ 2 y(^y{x) 


= 1 

> 1 

= 1 

= 1 
> 1 





w (b-iy- 



J V— [ai;J 


([awj - 1)! 


1/2 


(20.a) 


(20.b) 


(20.c) 

(20.d) 


where the step in ea. (l20.ap follows from the fact that for a fixed set y of 6 elements, if the set y has h — i common 
elements with y, then there are ("7^) possible ways of choosing the remaining i non-common elements in y' from out 
of u — 6 possible elements. The step in ea. (l20.bll follows from well-known inequalities for the binomial coefficient. The 
step in eg. (120. cl) follows from making 6 = \av\. The step in ea. (l20.dl) follows for any a G [0,1/2]. □ 


A.8 Proof of Claim 

Proof. Let A = (j){x,y) — 4>{x,y'). We also introduce a superindex p for the partitions. That is, for all p G 'P{x), let 
AP = (j){x, y) — 4>{x, y') for some y' G yp{x). By assumption, since y' G yp(x) then jA^j = 6 and (Vg ^ p) = 0. Note 
that jjAPjli = Egev(x) = l^pl = b- Thus 1 Ap|/||Ap||i = 1 and (Vg ^ p) Ap/||Ap||i = 0. Therefore: 




\ 


V E 

y'^E(x) 
qe-P(x)‘' ^ ’ 


A„ 


n 2 


IIAllJ 


< 


\ 


V E 

y'r^Rix) 
q^V{x)^ ^ ‘ 


IIAllJ 




q(^V{x) \p€V{x) II 111 


\ 




qevix) 


= ^\Vix)\ 

= 1//^(JI 


|iP(J| 


where we used the fact that for a uniform proposal distribution R(x), we have ^y'r.^R{w,x)\y' e yq(x)] = l/|7’(a:)l. Finally, 
since we assume that n < \V{x)\/4,, we have 1/\J\V{x)\ < l/{2^Jn) and we prove our claim. □ 
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A.9 Proof of Claim [vn] 

Proof. Let A = 4>{x,y) — By assumption |Ap| = b for allp G 'P{x). Note that ||A||^ = X]pGP(a:) I^pI = 

Thus |Ap|/||A||^ = l/\'P{x)\ for all p G 'P{x). Therefore: 


, I HA)] 


\ 


E 


p 


V 

y''^R(w,x) HAIL 
pG-P(x) " Lll lllJ 


A, 


< 




E 

pGP(x) 


E 


y'~R{w,x) [||A||^_ 


I Ap 



= i/v1^ 


Finally, since we assume that n < \'P{x)\/A, we have 1/^J\'P{x)\ < l/(2i/n) and we prove our claim. □ 

A.10 Proof of Claim [vm] 

Proof. Algorithm [1] depends solely on the linear ordering induced by the parameter w and the mapping <f)(x,-). That 
is, at any point in time, Algorithm [T] executes comparisons of the form <f)(x,y) ■ w > 4i{x,y) ■ w for any two structured 
outputs y and y. □ 

A.11 Proof of Claim 

Proof. Algorithm [2] depends solely on the linear ordering induced by the parameter w and the mapping (/)(x, •). That 
is, at any point in time. Algorithm [2] executes comparisons of the form <f>{x,y) ■ w > 4>{x,y) ■ w for any two structured 
outputs y and y. □ 
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